2 research outputs found
Tracking System Behaviour from Resource Usage Data
Resource usage data, collected using tools such as TACC Stats, capture the
resource utilization by nodes within a high performance computing system. We
present methods to analyze the resource usage data to understand the system
performance and identify performance anomalies. The core idea is to model the
data as a three-way tensor corresponding to the compute nodes, usage metrics,
and time. Using the reconstruction error between the original tensor and the
tensor reconstructed from a low rank tensor decomposition, as a scalar
performance metric, enables us to monitor the performance of the system in an
online fashion. This error statistic is then used for anomaly detection that
relies on the assumption that the normal/routine behavior of the system can be
captured using a low rank approx- imation of the original tensor. We evaluate
the performance of the algorithm using information gathered from system logs
and show that the performance anomalies identified by the proposed method
correlates with critical errors reported in the system logs. Results are shown
for data collected for 2013 from the Lonestar4 system at the Texas Advanced
Computing Center (TACC
dynamicMF: A Matrix Factorization Approach to Monitor Resource Usage in High Performance Computing Systems
High performance computing (HPC) facilities consist of a large number of
interconnected computing units (or nodes) that execute highly complex
scientific simulations to support scientific research. Monitoring such
facilities, in real-time, is essential to ensure that the system operates at
peak efficiency. Such systems are typically monitored using a variety of
measurement and log data which capture the state of the various components
within the system at regular intervals of time. As modern HPC systems grow in
capacity and complexity, the data produced by current resource monitoring tools
is at a scale that it is no longer feasible to be visually monitored by
analysts. We propose a method that transforms the multi-dimensional output of
resource monitoring tools to a low dimensional representation that facilitates
the understanding of the behavior of a High Performance Computing (HPC) system.
The proposed method automatically extracts the low-dimensional signal in the
data which can be used to track the system efficiency and identify performance
anomalies. The method models the resource usage data as a three dimensional
tensor (capturing resource usage of all compute nodes for difference resources
over time). A dynamic matrix factorization algorithm, called dynamicMF, is
proposed to extract a low-dimensional temporal signal for each node, which is
subsequently fed into an anomaly detector. Results on resource usage data
collected from the Lonestar 4 system at the Texas Advanced Computing Center
show that the identified anomalies are correlated with actual anomalous events
reported in the system log messages.Comment: 11 page